Max Pellert (https://mpellert.at)
Deep Learning for the Social Sciences
| Date | Topic | Who? |
|---|---|---|
| 28.5. | No class | |
| 4.6. | Generative Deep Learning 1 | Giordano |
| 11.6. | NLP 1 | Max |
| 18.6. | NLP 2 | Max |
| 25.6. | Reinforcement Learning | Giordano |
| 2.7. | Project presentation session | Max |
| 9.7. | Large Language Models | Max |
| 16.7. | Generative Deep Learning 2 | Giordano |
Complementary in many aspects to “Understanding Deep Learning”
For the next lectures: check out Chapter 12 on transformer architectures
Online copy here: https://www.bishopbook.com/
PDF Ebook provided by the university library
“Natural language processing (NLP) concerns itself with the interaction between natural human languages and computing devices. NLP is a major aspect of computational linguistics, and also falls within the realms of computer science and artificial intelligence.” https://www.kdnuggets.com/2017/02/natural-language-processing-key-terms-explained.html
“NLP is a field of linguistics and machine learning focused on understanding everything related to human language. The aim of NLP tasks is not only to understand single words individually, but to be able to understand the context of those words.” https://huggingface.co/course/
The following is a list of some of the common NLP tasks, with some examples of each:
Classifying whole sentences: Getting the sentiment of a review, detecting if an email is spam, determining if a sentence is grammatically correct or whether two sentences are logically related or not
Classifying each word in a sentence: Identifying the grammatical components of a sentence (noun, verb, adjective), or the named entities (person, location, organization)
Generating text content: Completing a prompt with auto-generated text, filling in the blanks in a text with masked words
Extracting an answer from a text: Given a question and a context, extracting the answer to the question based on the information provided in the context
Generating a new sentence from an input text: Translating a text into another language, summarizing a text
NLP isn’t strictly limited to written text though. It also tackles complex challenges in speech recognition and computer vision, such as generating a transcript of an audio sample or a description of an image.
→ For many tasks that involve more than simple word counting and matching you need complex NLP models that are usually based on deep learning
If we consider the following NLP workflow
We need some concepts to put all of this together
First, we break up the text that we want to feed into the model in its single elements, i.e. we tokenize
Texts are sequences of characters and other symbols
To work with those sequences, for many tasks we have to separate them into their individual units
Those units are called tokens
The process to seperate the tokens in a sequence is called tokenization
Many different strategies, the simplest is to tokenize on whitespaces
Often, a token is understood as referring to one word, but in fact it is a more general concept: an emoji can also be a token for example
Apart from tokenization, we usually don’t do any further pre-processing steps with complex NLP models based on deep learning
In fact, with complex models filtering out certain tokens such as for example stopwords (as is often done with more simple, dictionary based approaches) can even be counterproductive because it can destroy context
We are going beyond simple approaches and, eventually, we are going to abandon the bag of words assumption (= word order matters)
Here, we are seeing texts not as bag of words, but as collection of tokens with informative order
A token is always contextual, depending on its specific location in the collection of tokens
Our goal is to create vector representation of ordered sequences of tokens that can be used as model inputs with removing as little as possible from the original sequence
Just tokenizing at first glance seems to be a rather neutral operation, but check out this informative video to get a feeling for the unexpected effects tokenization can have:
Assume that we have tokenized a sequence of text that we want to feed as input to the model, by splitting up on whitespaces and adding a few rules how to handle symbols such as ‘!’, ‘.’ or ‘,’
To get from an ordered collection of text tokens to a first vector representation, we can just replace the token by it’s index (=row number) in the list of all tokens that the model know, the vocabulary
We will take a look a little bit later on ways to construct the vocabulary
For now, assume we have already constructed such a vocabulary
| Index | Vocabulary Entry |
|---|---|
| 1 | a |
| 2 | aardvark |
| … | … |
| 7 | an |
| 8 | ant |
| 9 | antelope |
| 10 | anvil |
| 11 | are |
| 12 | aspic |
| 13 | ate |
| … | … |
an aardvark ate an ant→ [7,2,13,7,8]
Depending on the model architecture and the tokenizer used, the vocabular will also contain special tokens
Examples are:
[EOS] … denotes the end of a sequence
[SEP] … used for training with sequence pairs, denotes the end of the first sequence
[CLS] … used for classification task, contains information of the whole sequence
[MASK] … used to “hide” one token by replacing it
Using all unique tokens in a text that were seperated by using heuristics such as splitting on whitespace and similar rules would result in a very large vocabulary
This will affect performance because of hard- and software constraints
Often, some tokens are very rare and uninformative, for example usernames (see SolidGoldMagikarp)
This approach would lead to an extremely large vocabulary that can be used to represent token sequences with just one integer per token
The other extreme would be to use only unique characters as the vocabulary
Character tokenization leads to a small vocabulary but produces long sequences to represent even individual words
For example the word “clown” alone would need to be represented by 5 integers, which also leads to performance bottlenecks
With that approach it proved very hard for a model to learn a useful representation of the letter “c” instead of the word “clown” for example
For these reasons, a middle way is often used: subword tokenization to build the vocabulary.
There are different algorithms (Byte-Pair Encoding (BPE), WordPiece, Unigram, …) that work in specific ways to achieve one goal: Frequent words should be included in the vocabulary as-is (“clown”), but not so frequent words should be combinations of words in the vocabulary (“clown” + “ish”)
Parts of the vocabulary that are added to other words are prefixed, for example “##ish”.
Let’s have a look at one of the algorithms to achieve that…
Goal: Tokenizer chooses it’s input “units”
To recap, some things to consider:
Solution: Sub-word tokenization
We set a stopping criterion at the desired size of the vocabulary (for example 50 000 or 150 000 tokens)
Provides a tradeoff between vocabulary size and size of integers necessary to represent a tokenized sequence
The vocabulary embeddings can actually just be randomly initialized in the beginning and be trained like other parameters
Or pretrained embeddings, e.g. from Word2Vec or GloVe can be used
Dimensionality is specified according to the model architecture, for example 1024 dimensions or more
This is one of the hyperparameters of NLP models
Another one concerns the batch size: how many individual sequences are passed in one iteration to the model, usually more is better here (and the limitation is often GPU memory)
Individual sequences longer than x tokens (for example 512 for BERT or also more for later models) often have to be truncated, shorter sequences are usually increased with 0 to maximum length (padding)
Let’s take a look at hyperparameters of a widely used model (that has this kind of information still publicly available)
Now we have our inputs ready, let’s consider the model part
Let’s take a step back and consider we have a collection of tokens in a corpus
The simplest approach: we could count the frequency of each token and record that in table
Let’s consider this still as a bag of words (no order), and create a sequence by consulting the table:
Such approaches can be used, for example with Naive Bayes classifiers, where we assume tokens are independent within each class \(C_k\) but with a different distribution for each class
We can estimate the prior and the class-conditional densities from an annotated training data set:
While such approaches can be useful, for example to create comparison baselines, we want to use the information contained in word order
We could represent each term on the right-hand side by a table whose entries are as before estimated using simple frequency counts from the training corpus
However, the size of these tables grows exponentially with the length of the sequence, and so this approach would become prohibitively expensive
We need to find a cure against this computational intractability
One way would be to simplify the model dramatically by assuming that each of the conditional distributions on the right-hand side is independent of all previous observations except the \(L\) most recent words
For example, if \(L = 2\) then the joint distribution for a sequence of \(N\) observations under this model is given by
With \(L = 1\) we would call this a bi-gram model because it depends on pairs of adjacent words
\(L = 2\) involves triplets of adjacent words, is called a tri-gram model, and in general these are called n-gram models
All the models discussed so far in this part can be run generatively to synthesize novel text
If we for example provide the first and second words in a sequence, then we can sample from the tri-gram statistics \(p(x_n |x_{n−1} , x_{n−2} )\) to generate the third word, and then we can use the second and third words to sample the fourth word, and so on
The resulting text, however, will be incoherent because each word is predicted only on the basis of the two previous words
High-quality text models must take account of the long-range dependencies in language
In addition, n-gram approaches don’t generalize well, for example to tokens that are not included in their training corpus (here usually some smoothing by adding small probabilities to such unknown tokens is used)
At the same time, we want to avoid the exponential growth in the number of parameters of an n-gram model because of computational constraints
Building huge tables is also generally not efficient, because even semantically very close sequences are recorded separately on their own, for example sentences that are exactly the same except for a synonym
A proposed solution should also account for that redundancy in natural language by making use of shared parameters
These early neural language models (NLMs) have intuitive limitations:
They ignore the global context provided by prefix tokens further than \(k\) tokens away (and typically \(k=5\) only)
They use a different set of parameters for each position in the prefix window
They have a relatively small number of parameters, which limits their expressiveness.
One of the problems with standard RNNs is that they still deal poorly with long-range dependencies
This is especially problematic for the domain of natural language where such dependencies are widespread
In a long passage of text, a concept might be introduced that plays an important role in predicting words occurring much later in the text
At each new step of the RNN, information from longer before “washes out”
Also, with such an RNN approach, the entire concept of the English sentence must be captured in the single hidden vector \(\mathbf{z}^*\) of fixed length
The network can start to generate the output translation only once the full input sequence has been processed
This becomes increasingly problematic with longer sequences
This is known as the bottleneck problem because it means that a sequence of arbitrary length has to be summarized in a single hidden vector of activations